9
Let us consider another dataset Auto which has 397 sample cars a host of other attributes. While reading the CSV file, we use the following arguments:
In [1]:
Auto = read.csv("Datasets/Auto.csv", header=TRUE, na.strings="?")
head(Auto)
Let us omit any sparse or incomplete data.
In [2]:
print(paste("Total Number of Sample cars = ", dim(Auto)[1]))
Auto=na.omit(Auto)
print(paste("Number of sample cars after deleting incomplete sample data = ", dim(Auto)[1]))
The last column is just the name of the car saples. Can we use this to index each row, replacing the numerical index? No, we cannot. In order to index a column, it must be unique. The table function allows us to create a frequency distribution table. Let us see which names are repeated.
In [3]:
n_occur = data.frame(table(Auto$name)) # This is a Frequency Distribution Table
n_occur[n_occur$Freq > 1,] #Get only duplicate entries
Each one of these entries is repeated at least once. So we will stick to the numerical indexing and remove the last names column.
In [4]:
car_names = Auto[,ncol(Auto)] #Store Car Names for later use
Auto = Auto[,-ncol(Auto)] #Remove the names column
head(Auto)
9 (a). Looking at the dataset it isn't too difficult to determine the qualitative and quantitative variables. The following variables are categorical (Qualitative):
NOTE: The field Year can be considered categorical since there are only 13 years: 70 to 82 inclusive. However, for the same problem with different samples, there may be a wide range of years (too much for considering it a categorical variable).
The remaining categories are Quantitative.
9 (b) Let us determine the range of these quantitative predictors. They are respectively the lowest and highest value of a field for all samples. We will determine them in 3 using:
Let us first take a look at the min and max implementation.
In [13]:
variable = c("mpg", "displacement", "horsepower", "weight", "acceleration", "year")
minimum = c(min(Auto$mpg), min(Auto$displacement), min(Auto$horsepower), min(Auto$weight), min(Auto$acceleration), min(Auto$year))
maximum = c(max(Auto$mpg), max(Auto$displacement), max(Auto$horsepower), max(Auto$weight), max(Auto$acceleration), max(Auto$year))
tab = data.frame(variable, minimum, maximum)
tab
Now let us take a look at the values using R's range function.
In [14]:
range(Auto$mpg)
range(Auto$displacement)
range(Auto$horsepower)
range(Auto$weight)
range(Auto$acceleration)
range(Auto$year)
And now the same with the summary function. Pay attention to the Min. and Max. values.
In [7]:
summary(Auto)
Ignoring the summary for qualitative variables, we see the minimum and maximum values for each quantitative field can be easily discovered. Their values are equal regardless of function used.
9 (c) Let us determine the mean, standard deviation of each predictor.
In [15]:
mean = c(mean(Auto$mpg), mean(Auto$displacement), mean(Auto$horsepower), mean(Auto$weight), mean(Auto$acceleration), mean(Auto$year))
sd = c(sd(Auto$mpg), sd(Auto$displacement), sd(Auto$horsepower), sd(Auto$weight), sd(Auto$acceleration), sd(Auto$year))
tab = data.frame(variable, mean, sd)
tab
9 (d) Let us remove the 10th to the 85th observation and observe the range, mean, and standard deviation for the data that remains.
In [16]:
sub_Auto = Auto[-c(10:85),] #Remove 10th to 85th row, keep all columns
minimum = c(min(sub_Auto$mpg), min(sub_Auto$displacement), min(sub_Auto$horsepower), min(sub_Auto$weight), min(sub_Auto$acceleration), min(sub_Auto$year))
maximum = c(max(sub_Auto$mpg), max(sub_Auto$displacement), max(sub_Auto$horsepower), max(sub_Auto$weight), max(sub_Auto$acceleration), max(sub_Auto$year))
mean = c(mean(sub_Auto$mpg), mean(sub_Auto$displacement), mean(sub_Auto$horsepower), mean(sub_Auto$weight), mean(sub_Auto$acceleration), mean(sub_Auto$year))
sd = c(sd(sub_Auto$mpg), sd(sub_Auto$displacement), sd(sub_Auto$horsepower), sd(sub_Auto$weight), sd(sub_Auto$acceleration), sd(sub_Auto$year))
tab = data.frame(variable, minimum, maximum, mean, sd)
tab
9 (e) Let us generate scatter plots among predictors.
In [17]:
pairs(c(Auto['mpg'], Auto['displacement'], Auto['horsepower'], Auto['weight'], Auto['acceleration'], Auto['year']))
Some observations based on the scatterplot matrix above:
9 (f) As mentioned in the first point of 9(e), there exists a negative correlation between mpg and the 3 predictors: displacement, horsepower and weight. Additionally, we see that newer vehicles have higher mpgs (evidenced by the mpg vs year scatterplot). Given enough datapoints, we may be able to make decent predictions with displacement, horsepower, weight and year as predictors.